14 research outputs found

    Developments from enquiries into the learnability of the pattern languages from positive data

    Get PDF
    AbstractThe pattern languages are languages that are generated from patterns, and were first proposed by Angluin as a non-trivial class that is inferable from positive data [D. Angluin, Finding patterns common to a set of strings, Journal of Computer and System Sciences 21 (1980) 46–62; D. Angluin, Inductive inference of formal languages from positive data, Information and Control 45 (1980) 117–135]. In this paper we chronologize some results that developed from the investigations on the inferability of the pattern languages from positive data

    Positive correlation between gene coexpression and positional clustering in the zebrafish genome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Co-expressing genes tend to cluster in eukaryotic genomes. This paper analyzes correlation between the proximity of eukaryotic genes and their transcriptional expression pattern in the zebrafish (<it>Danio rerio</it>) genome using available microarray data and gene annotation.</p> <p>Results</p> <p>The analyses show that neighbouring genes are significantly coexpressed in the zebrafish genome, and the coexpression level is influenced by the intergenic distance and transcription orientation. This fact is further supported by examining the coexpression level of genes within positional clusters in the neighbourhood model. There is a positive correlation between gene coexpression and positional clustering in the zebrafish genome.</p> <p>Conclusion</p> <p>The study provides another piece of evidence for the hypothesis that coexpressed genes do cluster in the eukaryotic genomes.</p

    Calibur: a tool for clustering large numbers of protein decoys

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Ab initio protein structure prediction methods generate numerous structural candidates, which are referred to as decoys. The decoy with the most number of neighbors of up to a threshold distance is typically identified as the most representative decoy. However, the clustering of decoys needed for this criterion involves computations with runtimes that are at best quadratic in the number of decoys. As a result currently there is no tool that is designed to exactly cluster very large numbers of decoys, thus creating a bottleneck in the analysis.</p> <p>Results</p> <p>Using three strategies aimed at enhancing performance (proximate decoys organization, preliminary screening via lower and upper bounds, outliers filtering) we designed and implemented a software tool for clustering decoys called Calibur. We show empirical results indicating the effectiveness of each of the strategies employed. The strategies are further fine-tuned according to their effectiveness.</p> <p>Calibur demonstrated the ability to scale well with respect to increases in the number of decoys. For a sample size of approximately 30 thousand decoys, Calibur completed the analysis in one third of the time required when the strategies are not used.</p> <p>For practical use Calibur is able to automatically discover from the input decoys a suitable threshold distance for clustering. Several methods for this discovery are implemented in Calibur, where by default a very fast one is used. Using the default method Calibur reported relatively good decoys in our tests.</p> <p>Conclusions</p> <p>Calibur's ability to handle very large protein decoy sets makes it a useful tool for clustering decoys in ab initio protein structure prediction. As the number of decoys generated in these methods increases, we believe Calibur will come in important for progress in the field.</p

    Developments from enquiries into the learnability of the pattern languages from positive data

    Get PDF
    The pattern languages are languages that are generated from patterns, and were first proposed by Angluin as a non-trivial class that is inferable from positive data [D. Angluin, Finding patterns common to a set of strings, Journal of Computer and System Sciences 21 (1980) 46--62; D. Angluin, Inductive inference of formal languages from positive data, Information and Control 45 (1980) 117--135]. In this paper we chronologize some results that developed from the investigations on the inferability of the pattern languages from positive data

    Large-scale 3D chromatin reconstruction from chromosomal contacts

    No full text
    Abstract Background Recent advances in genome analysis have established that chromatin has preferred 3D conformations, which bring distant loci into contact. Identifying these contacts is important for us to understand possible interactions between these loci. This has motivated the creation of the Hi-C technology, which detects long-range chromosomal interactions. Distance geometry-based algorithms, such as ChromSDE and ShRec3D, have been able to utilize Hi-C data to infer 3D chromosomal structures. However, these algorithms, being matrix-based, are space- and time-consuming on very large datasets. A human genome of 100 kilobase resolution would involve ∼30,000 loci, requiring gigabytes just in storing the matrices. Results We propose a succinct representation of the distance matrices which tremendously reduces the space requirement. We give a complete solution, called SuperRec, for the inference of chromosomal structures from Hi-C data, through iterative solving the large-scale weighted multidimensional scaling problem. Conclusions SuperRec runs faster than earlier systems without compromising on result accuracy. The SuperRec package can be obtained from http://www.cs.cityu.edu.hk/~shuaicli/SuperRec

    On triangle inequalities of correlation-based distances for gene expression profiles

    No full text
    Abstract Background Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute correlation distance, da=1ρd_a=1-|\rho | d a = 1 - | ρ | , where ρ\rho ρ is similarity measure, such as Pearson or Spearman correlation. However, the absolute correlation distance fails to fulfill the triangle inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as accelerated data clustering. Results In this work, we propose dr=1ρd_r=\sqrt{1-|\rho |} d r = 1 - | ρ | as an alternative. We prove that drd_r d r satisfies the triangle inequality when ρ\rho ρ represents Pearson correlation, Spearman correlation, or Cosine similarity. We show drd_r d r to be better than ds=1ρ2d_s=\sqrt{1-\rho ^2} d s = 1 - ρ 2 , another variant of dad_a d a that satisfies the triangle inequality, both analytically as well as experimentally. We empirically compared drd_r d r with dad_a d a in gene clustering and sample clustering experiment by real-world biological data. The two distances performed similarly in both gene clustering and sample clustering in hierarchical clustering and PAM (partitioning around medoids) clustering. However, drd_r d r demonstrated more robust clustering. According to the bootstrap experiment, drd_r d r generated more robust sample pair partition more frequently (P-value <0.05<0.05 < 0.05 ). The statistics on the time a class “dissolved” also support the advantage of drd_r d r in robustness. Conclusion drd_r d r , as a variant of absolute correlation distance, satisfies the triangle inequality and is capable for more robust clustering
    corecore